An Automatic Closed - Loop Methodology forGenerating Character
نویسندگان
چکیده
Character groundtruth for real, scanned document images is extremely useful for evaluating the performance of OCR systems, training OCR algorithms, and validating document degradation models. Unfortunately, manual collection of accurate groundtruth for characters in a real (scanned) document image is not possible because (i) accuracy in delineating groundtruth character bounding boxes is not high enough, (ii) it is extremely laborious and time consuming and (iii) the manual labor required for this task is prohibitively expensive. In this paper we give a closed-loop methodology for collecting very accurate (within a pixel error) groundtruth for scanned documents. We rst create ideal documents using a typesetting language. Next we create the groundtruth for the ideal document. The ideal document is then printed, photocopied and then scanned. A registration algorithm estimates the global geometric transformation and then performs a robust local bitmap match to register the ideal document image to the scanned document image. Finally, groundtruth associated with the ideal document image is transformed using the estimated geometric transform to create the groundtruth for the scanned document image. This methodology is very general and can be used for creating groundtruth for documents in typeset in any language, layout, font, and style. We have demonstrated the method by generating groundtruth for scanned and FAXed images. The cost of creating groundtruth using our methodology is minimal. If character, word or zone groundtruth is available for any real document, the registration algorithm can be used to generate the corresponding groundtruth for a rescanned version of the document.
منابع مشابه
An Automatic Closed-loop Methodology for Generating Character Groundtruth for Scanned Documents an Automatic Closed-loop Methodology for Generating Character Groundtruth for Scanned Documents an Automatic Closed-loop Methodology for Generating Character Groundtruth for Scanned Documents
Character groundtruth for real, scanned document images is crucial for evaluating the performance of OCR systems, training OCR algorithms, and validating document degradation models. Unfortunately, manual collection of accurate groundtruth for characters in a real (scanned) document image is not practical because (i) accuracy in delineating groundtruth character bounding boxes is not high enoug...
متن کاملAn Optimization Model for Multi-objective Closed-loop Supply Chain Network under uncertainty: A Hybrid Fuzzy-stochastic Programming Method
In this research, we address the application of uncertaintyprogramming to design a multi-site, multi-product, multi-period,closed-loop supply chain (CLSC) network. In order to make theresults of this article more realistic, a CLSC for a case study inthe iron and steel industry has been explored. The presentedsupply chain covers three objective functions: maximization ofprofit, minimization of n...
متن کاملA Point Matching Algorithm for Automatic Generation of Groundtruth for Document Images
Geometric groundtruth at character, word, and line level is crucial for developing and evaluating optical character recognition (OCR) algorithms. Kanungo and Haralick [ICPR ’96] proposed a closed loop methodology for generating character level groundtruth for rescanned image. In this article we present a robust version of their methodology. We grouped the feature points and used branch and boun...
متن کاملAn Inexact-Fuzzy-Stochastic Optimization Model for a Closed Loop Supply Chain Network Design Problem
The development of optimization and mathematical models for closed loop supply chain (CLSC) design has attracted considerable interest over the past decades. However, the uncertainties that are inherent in the network design and the complex interactions among various uncertain parameters are challenging the capabilities of the developed tools. The aim of this paper, therefore, is to propose a n...
متن کاملAn Automatic Closed-Loop Methodology for Generating Character Groundtruth for Scanned Documents
Character groundtruth for real, scanned document images is crucial for evaluating the performance of OCR systems, training OCR algorithms, and validating document degradation models. Unfortunately, manual collection of accurate groundtruth for characters in a real (scanned) document image is not practical because (i) accuracy in delineating groundtruth character bounding boxes is not high enoug...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1998